Your chat bot needs a better rate limit strategy

I’m on a project where we’re connecting to OpenAI to build a chat bot.

Because OpenAI’s rate limits are more complicated than other APIs, we made sure to proactively avoid rate limits by keeping track of how many tokens we’re using. However, this approach is incomplete in that it does nothing to limit an individual user’s usage.

A chat bot feature is unique in that it can quickly exhaust your organization’s rate limits across multiple dimensions: requests per minute (RPM) and tokens per minute (TPM). This is because…

  • Chatting enables the user to send messages in rapid succession, which increases RPM.
  • Messages can be long, contain attachments, and as the conversation grows, so does the context window, which increases TPM.

Once a user hits the organization’s limit, everything that uses OpenAI is affected, and will no longer work. In the case of my project, not only does that mean the chat bot feature would be down, but also other tools we use that leverage OpenAI.

In short, one user can trigger a denial of service simply by using a feature as it was intended to be used.

Fortunately, there are a few solutions to this problem.

Our base

For the sake of this demonstration, we’ll use RubyLLM, but the concepts we’ll learn apply to any implementation and platform.

chat = RubyLLM.chat

response = chat.ask "What's a good rate limit strategy for a chat bot?"

puts response.content

Rate limit messages

The first thing we can do is limit the number of messages a user can send in a given time frame. This is known as “fixed-window rate limiting”. Although Rails ships with a rate limit mechanism, that won’t help us when we need to rate limit token usage. Instead, we can rely on a cache store like Redis, since its built-in INCR and EXPIRE APIs lend themselves well to a rate limiting mechanism.

# app/models/usage.rb
class Usage
  MAX_RPM = ENV.fetch("USAGE_MAX_RPM", 10).to_i

  def initialize(user, cache_store = ActiveSupport::Cache::RedisCacheStore.new)
    @user = user
    @cache_store = cache_store
  end

  def track!
    track_rpm!
  end

  def exceeded?
    rpm_exceeded?
  end

  private

  attr_reader :user, :cache_store

  def rpm_key
    "usage:user:#{user.id}:rpm"
  end

  def track_rpm!
    cache_store.increment(rpm_key, 1, expires_in: 1.minute)
  end

  def rpm_exceeded?
    cache_store.read(rpm_key, raw: true).to_i >= MAX_RPM
  end
end

# app/models/user.rb
class User < ApplicationRecord
  def usage
    Usage.new(self)
  end
end

We can create a simple object that tracks user requests per minute simply by incrementing the value by one for each request made, making sure to expire the key after one minute.

Here’s how that might look when used with our chat bot:

chat = RubyLLM.chat
user = User.last!
usage = user.usage

unless usage.exceeded?
  response = chat.ask "What's a good rate limit strategy for a chat bot?"

  usage.track!

  puts response.content
else
  # Alert user that they've exceeded their usage.
end

Before making a request, we check to see if the user has exhausted their individual rate limit. If not, we make the request and track the usage.

Limit token usage

Limiting requests is just half of the problem, since token usage is also a metric that requires rate limiting.

We can’t just validate the length of the message, since token usage doesn’t map 1:1 with character length. Additionally, token usage is also based on output tokens.

Fortunately, OpenAI returns usage data in the response, which RubyLLM exposes in a Message instance.

 class Usage
   MAX_RPM = ENV.fetch("USAGE_MAX_RPM", 10).to_i
+  MAX_TPM = ENV.fetch("USAGE_MAX_TPM", 10_000).to_i

   def initialize(user, cache_store = ActiveSupport::Cache::RedisCacheStore.new)
     @user = user
     @cache_store = cache_store
   end

-  def track!
-    track_rpm!
+  def track!(total_tokens)
+    [ track_rpm!, track_tpm!(total_tokens) ]
   end

   def exceeded?
-    rpm_exceeded?
+    rpm_exceeded? || tpm_exceeded?
   end

   private
@@ -29,4 +30,16 @@ class Usage
   def rpm_exceeded?
     cache_store.read(rpm_key, raw: true).to_i >= MAX_RPM
   end
+
+  def tpm_key
+    "usage:user:#{user.id}:tpm"
+  end
+
+  def track_tpm!(total_tokens)
+    cache_store.increment(tpm_key, total_tokens, expires_in: 1.minute)
+  end
+
+  def tpm_exceeded?
+    cache_store.read(tpm_key, raw: true).to_i >= MAX_TPM
+  end
 end

We can use the same pattern we used for tracking requests to track tokens. The only difference is that we need to supply that information.

Here’s how that would look in our chat bot:

 unless usage.exceeded?
   response = chat.ask "What's a good rate limit strategy for a chat bot?"

-  usage.track!
+  tokens = response.tokens
+  total_tokens = tokens.input + tokens.output + tokens.thinking
+
+  usage.track!(total_tokens)

   puts response.content
 else

After making a request, we extract the token usage from the response, and pass it to our Usage instance to be used in the next request.

Calculating per-user rate limits

Since the rate limits set by OpenAI and other providers are at the organization level, how might we evenly distribute those values on a per-user basis?

A simple approach suitable for most early-stage products would be to divide the organization limit by the expected concurrent users. In order to ensure we account for unexpected spikes in traffic, we can add in a buffer.

Per-user limit = (Organization limit × buffer) / Expected concurrent users

Below is what that looks like using the values in this demonstration.

Metric Org Limit Concurrent Users Buffer Per-user Limit
RPM 125 10 0.8 (125 * 0.8) / 10
TPM 125,000 10 0.8 (125,000 * 0.8) / 10

Additional considerations

The core problem is that a chat bot is inherently expensive. If we stick to these numbers, we’re constrained to 10 concurrent users, which feels pretty limited, even for an early-stage product.

One solution is to queue the requests rather than reject them, which might look something like this:

class ProcessPromptJob < ApplicationJob
  queue_as :default

  MAX_ATTEMPTS = 2

  rescue_from(Usage::RateLimitExhaustedError) do
    if executions < MAX_ATTEMPTS
      retry_job wait: 15.seconds
    else
      # Broadcast failure message to user
    end
  end

  def perform(user, prompt)
    raise Usage::RateLimitExhaustedError if user.usage.exceeded?

    chat = RubyLLM.chat

    response = chat.ask(prompt)

    tokens = response.tokens
    total_tokens = tokens.input + tokens.output + tokens.thinking

    user.usage.track!(total_tokens)

    # Broadcast LLM response to user
  end
end
 class Usage
+  class RateLimitExhaustedError < StandardError; end
+
   MAX_RPM = ENV.fetch("USAGE_MAX_RPM", 10).to_i
   MAX_TPM = ENV.fetch("USAGE_MAX_TPM", 10_000).to_i

The flow would be something like this:

User sends message
  └─▶ Show loading state
        └─▶ Enqueue background job
              └─▶ Rate limit exceeded?
                    ├─ No  ──▶ Process request ──▶ Broadcast response
                    └─ Yes ──▶ Wait 15 seconds
                                 └─▶ Rate limit exceeded?
                                       ├─ No  ──▶ Process request ──▶ Broadcast response
                                       └─ Yes ──▶ Broadcast failure

Wrapping up

Limiting a user’s application usage is not a new problem, but it’s more relevant today with the rise in chat bot features.

The examples above highlight simple solutions, but they’re just a start — a more nuanced approach will likely be needed.

For example, if a user hits their limit, do you attempt to upsell them? Do you flat out block them? Or, should you fall back to a cheaper model? Maybe a combination of these suggestions?

Regardless, these decisions should involve stakeholders such as designers and the product team.

About thoughtbot

We've been helping engineering teams deliver exceptional products for over 20 years. Our designers, developers, and product managers work closely with teams to solve your toughest software challenges through collaborative design and development. Learn more about us.